Introduction

library(GO.db)
library(graph)
library(dnet)
library(magrittr)
library(tidyverse)
library(pander)
library(scales)
library(plotly)
library(limma)

The aim of this page is to keep an up-to-date summary of the Gene Ontology database, as represented in the R package GO.db. The information contained for each GO term is:

The main object created is available for download here and this will load as a tibble after using the function read_rds() (fro the package readr)

Create the Basic Graphs

The package AnnotationDbi contains the function makeGOGraph which returns a graphNEL graph. In the following, we will:

  1. Create a graph for each ontology and remove the node "all" as this is essentially redundant
  2. Reverse the direction of the DAG for compatibility with tools in the package dnet. Whilst hosted on CRAN, this package Depends on the bioconductor package supraHex and should be installed using BiocManager::install("dnet")

Note that the following code can take several minutes to run as these are large graphs.

graphs <- c(BP = "bp", CC = "cc", MF = "mf") %>%
    lapply(makeGOGraph) %>%
    lapply(function(x){removeNode("all", x)}) %>%
    lapply(dDAGreverse)
Summary of graph sizes for each ontology
Ontology Nodes Edges
BP 29,699 71,745
CC 4,202 7,534
MF 11,148 13,681

Find the Key Information

goSummaries <- lapply(graphs, function(x){
    lng <- dDAGlevel(x, "longest_path") - 1
    shrt <- dDAGlevel(x, "shortest_path") - 1
    tips <- dDAGtip(x)
    tibble(
        id = unique(c(names(lng), names(shrt))),
        shortest_path = shrt,
        longest_path = lng,
        terminal_node = id %in% tips
        )
}) %>%
    bind_rows() %>%
    mutate(ontology = Ontology(id))
Path lengths for each ontology, based on whether a term is a terminal node or not.

Path lengths for each ontology, based on whether a term is a terminal node or not.

Cumulative number of GO Terms with paths $\geq$ x.

Cumulative number of GO Terms with paths \(\geq\) x.

Examples

In reality, we can just add this table to our GO analysis table from tools like goana() and use it to filter results before adjusting p-values.

As an example of how to use this to assist our decision making, if we chose to remove GO terms with a shortest path \(\leq\) 4, we can see how many terms we would keep and retain.

ggplotly(goSummaries %>%
    mutate(keep = shortest_path > 4) %>%
    ggplot(aes(keep, fill = terminal_node)) +
    geom_bar() +
    facet_wrap(~ontology, nrow = 1) +
    scale_y_continuous(labels = comma) +
    labs(x = "Term Retained", y = "Total") +
    theme_bw())
goSummaries %>%
    mutate(keep = shortest_path > 4) %>%
    group_by(ontology, terminal_node, keep) %>%
    tally() %>%
    spread(key = keep, value = n) %>%
    rename(Discard = `FALSE`,
           Retain = `TRUE`) %>%
    bind_rows(
        tibble(ontology = "**Total**",
               Discard = sum(.$Discard),
               Retain = sum(.$Retain))) %>%
    pander(big.mark = ",",
           justify = "llrr")
ontology terminal_node Discard Retain
BP FALSE 8,149 9,346
BP TRUE 4,200 8,004
CC FALSE 1,178 280
CC TRUE 2,093 651
MF FALSE 987 1,084
MF TRUE 2,142 6,935
Total NA 18,749 26,300

Alternatively, we could remove GO terms with a longest path back to the root node is \(\leq 5\) steps.

ggplotly(goSummaries %>%
    mutate(keep = longest_path > 5) %>%
    ggplot(aes(keep, fill = terminal_node)) +
    geom_bar() +
    facet_wrap(~ontology, nrow = 1) +
    scale_y_continuous(labels = comma) +
    labs(x = "Term Retained", y = "Total") +
    theme_bw())
goSummaries %>%
    mutate(keep = longest_path > 5) %>%
    group_by(ontology, terminal_node, keep) %>%
    tally() %>%
    spread(key = keep, value = n) %>%
    rename(Discard = `FALSE`,
           Retain = `TRUE`) %>%
    bind_rows(
        tibble(ontology = "**Total**",
               Discard = sum(.$Discard),
               Retain = sum(.$Retain))) %>%
    pander(big.mark = ",",
           justify = "llrr")
ontology terminal_node Discard Retain
BP FALSE 3,400 14,095
BP TRUE 889 11,315
CC FALSE 407 1,051
CC TRUE 562 2,182
MF FALSE 1,227 844
MF TRUE 5,837 3,240
Total NA 12,322 32,727

The summaries obtained above can be downloaded from here. Place this in the appropriate folder and the object can then be imported using read_rds("path/to/goSummaries.RDS")

Use with goana

As an artificial example, let’s use a subset of the genes associated with GO:0005795 (Golgi Stack) as a pretend set of DE genes.

library(org.Hs.eg.db)
set.seed(65)
myDE <- get("GO:0005795", org.Hs.egGO2ALLEGS) %>% sample(50)
goResults <- goana(myDE, species = "Hs")

If we leave these as is, there will be a few high=level GO terms which are relatively meaningless, as well as GO terms not present in our DE genes. These can easily be seen at the top of the Ancestor Chart for this term

goResults %>%
    rownames_to_column("id") %>%
    as_tibble() %>%
    arrange(P.DE) %>% 
    mutate(adjP = p.adjust(P.DE, "bonferroni"),
           FDR = p.adjust(P.DE, "fdr")) %>%
    slice(1:10) %>%
    pander(caption = "Unfiltered results for GO analysis",
           justify = "lllrrrrr",
           plain.ascii = TRUE,
           split.tables = Inf)
Unfiltered results for GO analysis
id Term Ont N DE P.DE adjP FDR
GO:0005795 Golgi stack CC 145 48 1.192e-107 2.716e-103 2.716e-103
GO:0031985 Golgi cisterna CC 113 36 7.431e-74 1.694e-69 8.468e-70
GO:0098791 Golgi subcompartment CC 841 48 8.913e-68 2.031e-63 6.771e-64
GO:0044431 Golgi apparatus part CC 957 48 5.209e-65 1.187e-60 2.968e-61
GO:0032580 Golgi cisterna membrane CC 92 29 7.237e-58 1.649e-53 3.299e-54
GO:0005794 Golgi apparatus CC 1533 48 5.454e-55 1.243e-50 2.072e-51
GO:0031984 organelle subcompartment CC 1674 48 3.967e-53 9.041e-49 1.292e-49
GO:0000139 Golgi membrane CC 741 40 2.57e-50 5.857e-46 7.321e-47
GO:0012505 endomembrane system CC 4398 48 8.281e-33 1.887e-28 2.097e-29
GO:0098588 bounding membrane of organelle CC 2047 40 1.344e-32 3.063e-28 3.063e-29

I usually like to rearrange mine as a tibble and do some filtering as well. In the following we’ll:

filteredGO <- goResults %>%
    rownames_to_column("id") %>%
    as_tibble() %>%
    arrange(P.DE) %>%
    left_join(goSummaries) %>%
    filter(DE > 0) %>%
    filter(shortest_path > 4) %>%
    mutate(adjP = p.adjust(P.DE, "bonferroni"),
           FDR = p.adjust(P.DE, "fdr"))

Now we have removed some of the redundant terms, we can just look at the top 10.

Top 10 terms for GO analysis after filtering
id Term Ont N DE P.DE adjP FDR
GO:0005795 Golgi stack CC 145 48 1.192e-107 4.933e-105 4.933e-105
GO:0031985 Golgi cisterna CC 113 36 7.431e-74 3.076e-71 1.538e-71
GO:0032580 Golgi cisterna membrane CC 92 29 7.237e-58 2.996e-55 9.986e-56
GO:0009101 glycoprotein biosynthetic process BP 326 20 7.288e-24 3.017e-21 7.543e-22
GO:0008417 fucosyltransferase activity MF 13 6 2.074e-13 8.588e-11 1.718e-11
GO:0000137 Golgi cis cisterna CC 28 7 2.875e-13 1.19e-10 1.984e-11
GO:0046920 alpha-(1->3)-fucosyltransferase activity MF 8 4 1.872e-09 7.751e-07 1.107e-07
GO:0019317 fucose catabolic process BP 9 4 3.364e-09 1.393e-06 1.393e-07
GO:0042354 L-fucose metabolic process BP 9 4 3.364e-09 1.393e-06 1.393e-07
GO:0042355 L-fucose catabolic process BP 9 4 3.364e-09 1.393e-06 1.393e-07

Or we could find which terminal nodes are significant, as these may be especially informative.

filteredGO %>%
    filter(adjP < 0.05, terminal_node) %>%
    dplyr::select(id, Term, Ont, N, DE, P.DE, adjP, FDR) %>%
    pander(caption = "Significant GO terms with no Child Terms",
           justify = "lllrrrrr",
           plain.ascii = TRUE,
           split.tables = Inf)
Significant GO terms with no Child Terms
id Term Ont N DE P.DE adjP FDR
GO:0042355 L-fucose catabolic process BP 9 4 3.364e-09 1.393e-06 1.393e-07
GO:0032588 trans-Golgi network membrane CC 90 5 2.198e-06 0.00091 6.067e-05
GO:0004065 arylsulfatase activity MF 12 3 2.647e-06 0.001096 6.848e-05
GO:0014846 esophagus smooth muscle contraction BP 2 2 5.417e-06 0.002243 0.0001246
GO:0017060 3-galactosyl-N-acetylglucosaminide 4-alpha-L-fucosyltransferase activity MF 3 2 1.623e-05 0.006718 0.0002921
GO:0008449 N-acetylglucosamine-6-sulfatase activity MF 3 2 1.623e-05 0.006718 0.0002921
GO:0018146 keratan sulfate biosynthetic process BP 28 3 3.838e-05 0.01589 0.0006356
GO:0003836 beta-galactoside (CMP) alpha-2,3-sialyltransferase activity MF 6 2 8.077e-05 0.03344 0.001115

Note that we just randomly grabbed 50 genes from GO:0005795, so this may also highlight issues with the general GO analytic approach and being cautious about low-level terms with only a small number of genes.

Session Info

R version 3.5.2 (2018-12-20)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: org.Hs.eg.db(v.3.7.0), bindrcpp(v.0.2.2), limma(v.3.38.3), plotly(v.4.8.0), scales(v.1.0.0), pander(v.0.6.3), forcats(v.0.3.0), stringr(v.1.3.1), dplyr(v.0.7.8), purrr(v.0.2.5), readr(v.1.3.1), tidyr(v.0.8.2), tibble(v.2.0.0), ggplot2(v.3.1.0), tidyverse(v.1.2.1), magrittr(v.1.5), dnet(v.1.1.4), supraHex(v.1.20.0), hexbin(v.1.27.2), igraph(v.1.2.2), graph(v.1.60.0), GO.db(v.3.7.0), AnnotationDbi(v.1.44.0), IRanges(v.2.16.0), S4Vectors(v.0.20.1), Biobase(v.2.42.0) and BiocGenerics(v.0.28.0)

loaded via a namespace (and not attached): nlme(v.3.1-137), lubridate(v.1.7.4), bit64(v.0.9-7), httr(v.1.4.0), Rgraphviz(v.2.26.0), tools(v.3.5.2), backports(v.1.1.3), R6(v.2.3.0), DBI(v.1.0.0), lazyeval(v.0.2.1), colorspace(v.1.3-2), withr(v.2.1.2), tidyselect(v.0.2.5), bit(v.1.1-14), compiler(v.3.5.2), cli(v.1.0.1), rvest(v.0.3.2), Cairo(v.1.5-9), xml2(v.1.2.0), labeling(v.0.3), digest(v.0.6.18), rmarkdown(v.1.11), pkgconfig(v.2.0.2), htmltools(v.0.3.6), highr(v.0.7), htmlwidgets(v.1.3), rlang(v.0.3.0.1), readxl(v.1.2.0), rstudioapi(v.0.8), RSQLite(v.2.1.1), shiny(v.1.2.0), bindr(v.0.1.1), generics(v.0.0.2), jsonlite(v.1.6), crosstalk(v.1.0.0), Matrix(v.1.2-15), Rcpp(v.1.0.0), munsell(v.0.5.0), ape(v.5.2), stringi(v.1.2.4), yaml(v.2.2.0), MASS(v.7.3-51.1), plyr(v.1.8.4), grid(v.3.5.2), blob(v.1.1.1), promises(v.1.0.1), crayon(v.1.3.4), lattice(v.0.20-38), haven(v.2.0.0), hms(v.0.4.2), knitr(v.1.21), pillar(v.1.3.1), reshape2(v.1.4.3), glue(v.1.3.0), evaluate(v.0.12), data.table(v.1.11.8), modelr(v.0.1.2), httpuv(v.1.4.5.1), cellranger(v.1.1.0), gtable(v.0.2.0), assertthat(v.0.2.0), xfun(v.0.4), mime(v.0.6), xtable(v.1.8-3), broom(v.0.5.1), later(v.0.7.5), viridisLite(v.0.3.0) and memoise(v.1.1.0)